library(ggplot2)
library(ggcorrplot)
library(corrplot)
library(lessR)
library(CARS)
library(olsrr)

Reading in the full Dataset.

data <- read.csv(file="ames_housing_data.csv",head=TRUE,sep=",")

Data with continuous variable

data_cont <- data[,c("LotFrontage","LotArea","MasVnrArea","BsmtFinSF1","BsmtFinSF2","TotalBsmtSF", "FirstFlrSF","SecondFlrSF","LowQualFinSF","GrLivArea","GarageArea","WoodDeckSF", "OpenPorchSF","EnclosedPorch", "ThreeSsnPorch", "ScreenPorch","PoolArea","MiscVal","SalePrice")]

Initial views indicate that GR Living area may be the “best” continuous variable to start. Histogram shows a fairly “normal” distribution (with some outliers), and correlation to sale price is the highest at 71%. QQ plot appears to show it’s one of the more normal ones among all continuous variables.

Part A

model1 <-lm(SalePrice~GrLivArea, data=data_cont)
summary(model1)
## 
## Call:
## lm(formula = SalePrice ~ GrLivArea, data = data_cont)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)
## (Intercept) 13289.634   3269.703   4.064            0.0000494
## GrLivArea     111.694      2.066  54.061 < 0.0000000000000002
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 0.00000000000000022

a.

Make a scatterplot of Y and X, and overlay the regression line on the cloud of data.
***

ggplot(data_cont, aes(x=GrLivArea, y=SalePrice)) + 
geom_point(col = "purple") + 
geom_smooth(method='lm', fill = "red") + 
  theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  labs(title = "Housing Price",x="Above Ground living area", y="Sale Price")

b


The equation for the linear regression using Above ground living area is \(Y = 13289.634 + 111.694*X_{1}\). 13289.634 is the intercept, which says that if the above ground living area is 0, then the housing price is predicted to be 13289.634. This may or may not make sense. A house with 0 square feet does not seem logical, but it could mean it’s the price of the lot(the land).
\(B_{1}\) is 111.694, which indicates that for every squarefeet change, price increases about $111.70.

c


The R-squared is 49.95%, which says that this variable explains about half the variance in the price. Adjusted R squared is similar, which is expected as there are not other variables to consider.

d


\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)
This is to say that the null hypothesis is that squarefootage of above ground does not affect sale price.
The alternate hypothesis is to reject the Null.
The omnibus hypotheis is the same as we are only using a single variable.

Based on the anova output, the F statistic is significant, so we reject the null hypothesis that the square footage of above grade is 0.

anova(model1)
## Analysis of Variance Table
## 
## Response: SalePrice
##             Df        Sum Sq       Mean Sq F value                Pr(>F)
## GrLivArea    1 9337629924310 9337629924310  2922.6 < 0.00000000000000022
## Residuals 2928 9354907186041    3194981962

e


#a - predicted values of model
model1_predicted_values<- model1$fitted.values
data_cont$predicted_values <- model1_predicted_values

#b - calculating the residuals
data_cont$residuals <-data_cont$SalePrice - data_cont$predicted_values

#mean of residuals
sd_residuals <- sd(data_cont$residuals)
#standardizing the residuals
data_cont$standardized_residuals <- data_cont$residuals/sd_residuals

Histogram of Standardized Residuals

The standardized residuals look normal from a histogram standpoint as they center around zero, but we do see observation around 5 and more standard deviations.

ggplot(data_cont, aes(x=data_cont$standardized_residuals)) + geom_histogram(fill = "purple") + theme_classic() + labs(title = "Standarized Residuals", x="standard deviation")

resdiuals vs predicted values.

As the predicted values get higher, the residuals start to “fan out”. So for the lower home prices, we seem to be predicting well, but as the price gets higher and higher, our difference between y and y_hat gets larger and larger.
From a outliers and influential points perspective, there are a few observations THAT are over $500,000. These can potentially be influential variables that is driving the results.

ggplot(data_cont, aes(x=data_cont$predicted_values,y=data_cont$standardized_residuals)) + 
  geom_point(col = "purple") + 
  theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1)) + labs(Title = "standarized residuals", y="standarized residuals", x = "predicted Sale Price")

# 2

model2_data <- data[,c("OverallQual","SalePrice")]
model2 <-lm(SalePrice ~ OverallQual, data=model2_data)
summary(model2)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual, data = model2_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -197507  -29254   -2283   21658  397493 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)
## (Intercept) -95003.6     3933.8  -24.15 <0.0000000000000002
## OverallQual  45251.0      628.8   71.96 <0.0000000000000002
## 
## Residual standard error: 48020 on 2928 degrees of freedom
## Multiple R-squared:  0.6388, Adjusted R-squared:  0.6387 
## F-statistic:  5179 on 1 and 2928 DF,  p-value: < 0.00000000000000022

a

ggplot(data, aes(x=data$OverallQual, y=SalePrice)) + geom_point(col = "purple") + geom_smooth(method='lm', fill = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))

b

The model equation for this is \(Y = -95003.6 + 45251.0*X_{1}\).
Similar to the prior example, the intercept may not be informative here, but the \(B_{1}\) is saying that as the overall quality rating increases, the sale price also increases 45,251.

The interpretation using Overall quality should be bit different. Since the independent variable is not continuous, the interpretation may not be accurate. For example, looking at model1, to say that each unit change in square footage will increase/decrease sale by $x makes sense and is consistent for all square footage unit change.
However, for overall quality, each rating increase in the condition may not necessairly equate to the same dollar change in price because the jump from a rating 1 to 2, vs a rating from 7 to 8 may be different. so it may be true to say that when rating2 from rating1 generates a price increase of $45251, that may not be true from a rating10 to a rating9 standpoint.

C

The R-squared in this example says that this variable explains 64% of the variance. Again, it may be true and accurate, but the impacts of each specific rating change is still unknown.

d

\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)
This is to say that the null hypothesis is that overall Quality does not affect sale price.
The alternate hypothesis is to reject the Null.
The omnibus hypotheisis IS the same as we are only using a single variable.

Based on the anova output, the F statistic is significant, so we reject the null hypothesis that the overall quality of above grade is 0.

anova(model2)
## Analysis of Variance Table
## 
## Response: SalePrice
##               Df         Sum Sq        Mean Sq F value                Pr(>F)
## OverallQual    1 11941155651186 11941155651186  5178.7 < 0.00000000000000022
## Residuals   2928  6751381459165     2305799679

e

For this (and future exercises), we will rely on the lessR or other packages for the visuals.
Looking at the residuals vs fitted, we see a similar “fanning out” pattern, indicating heteroscedascitity and potential outliers in the upper ends of residuals and leverage variables in the more expensive houses.

reg2<-reg(SalePrice~OverallQual,data=model2_data)

reg2
## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(SalePrice ~ OverallQual, data=model2_data, Rmd="eg")  
## 
## 
##   BACKGROUND
## 
## Data Frame:  model2_data 
##  
## Response Variable: SalePrice 
## Predictor Variable: OverallQual 
##  
## Number of cases (rows) of data:  2930 
## Number of cases retained for analysis:  2930 
## 
## 
##   BASIC ANALYSIS
## 
##               Estimate    Std Err  t-value  p-value     Lower 95%    Upper 95% 
## (Intercept) -95003.551   3933.822  -24.150    0.000   -102716.890   -87290.213 
## OverallQual  45251.028    628.805   71.964    0.000     44018.083    46483.973 
## 
## 
## Standard deviation of residuals:  48018.743 for 2928 degrees of freedom 
##  
## R-squared:  0.639    Adjusted R-squared:  0.639    PRESS R-squared:  0.638 
## 
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 5178.748     df: 1 and 2928     p-value:  0.000 
## 
## 
##               df              Sum Sq             Mean Sq   F-value   p-value 
## Model          1  11941155651186.020  11941155651186.020  5178.748     0.000 
## Residuals   2928   6751381459165.417      2305799678.677 
## SalePrice   2929  18692537110351.438      6381883615.688 
## 
## 
##   K-FOLD CROSS-VALIDATION
## 
##   RELATIONS AMONG THE VARIABLES
## 
##               SalePrice OverallQual 
##     SalePrice      1.00        0.80 
##   OverallQual      0.80        1.00 
## 
## 
##   RESIDUALS AND INFLUENCE
## 
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [res_rows = 20, out of 2930 rows of data, or do res_rows="all"] 
## ------------------------------------------------------------------------ 
##        OverallQual  SalePrice     fitted       resid rstdnt dffits cooks 
##   1768          10     755000 357506.731  397493.269  8.388  0.457 0.102 
##   1761          10     745000 357506.731  387493.269  8.172  0.445 0.097 
##   2446          10     625000 357506.731  267493.269  5.608  0.305 0.046 
##   1064          10     615000 357506.731  257493.269  5.396  0.294 0.043 
##    433          10     610000 357506.731  252493.269  5.290  0.288 0.041 
##     45           9     611657 312255.702  299401.298  6.282  0.266 0.035 
##   1638           9     591587 312255.702  279331.298  5.855  0.248 0.030 
##   2451           9     584500 312255.702  272244.298  5.705  0.241 0.029 
##    434           9     582933 312255.702  270677.298  5.672  0.240 0.029 
##   1499          10     160000 357506.731 -197506.731 -4.130 -0.225 0.025 
##    424          10     555000 357506.731  197493.269  4.130  0.225 0.025 
##    457          10     552000 357506.731  194493.269  4.067  0.221 0.024 
##   2333           9     556581 312255.702  244325.298  5.115  0.216 0.023 
##   2331          10     545224 357506.731  187717.269  3.925  0.214 0.023 
##   2335          10     535000 357506.731  177493.269  3.710  0.202 0.020 
##   2181          10     183850 357506.731 -173656.731 -3.629 -0.198 0.019 
##   2182          10     184750 357506.731 -172756.731 -3.610 -0.197 0.019 
##   2904           1      81500 -49752.523  131252.523  2.743  0.190 0.018 
##     16           8     538000 267004.674  270995.326  5.676  0.176 0.015 
##    367           9     501837 312255.702  189581.298  3.962  0.168 0.014 
## 
## 
##   FORECASTING ERROR
## 
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals do pred_rows="all"] 
## ------------------------------------------------------------------------------------ 
##        OverallQual  SalePrice       pred        sf      pi:lwr     pi:upr      width 
##    766           1      61000 -49752.523 48133.671 -144131.798  44626.752 188758.550 
##   1554           1      13100 -49752.523 48133.671 -144131.798  44626.752 188758.550 
## ... 
##   2929           5     170000 131251.590 48031.871   37071.921 225431.258 188359.337 
##      1           6     215000 176502.618 48026.974   82332.552 270672.684 188340.132 
##      3           6     172000 176502.618 48026.974   82332.552 270672.684 188340.132 
## ... 
##   2383          10     465000 357506.731 48089.671  263213.730 451799.731 188586.001 
##   2446          10     625000 357506.731 48089.671  263213.730 451799.731 188586.001 
##   2667          10     475000 357506.731 48089.671  263213.730 451799.731 188586.001 
## 
## 
## ----------------------------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## Plot 3: Reg Line, Confidence and Prediction Intervals 
## -----------------------------------------------------

3

Overall, OverallQual has a higher R squared, but I"m a bit hesitant on the result because of it not being a continuous variable. The Overall Quality may explain more variance, but when thinking about quanitifying impact (inferential), it may not be as useful. If this was a purely prediction exercise, I would use OverallQual. If this was an inferential exercise to understand the impact of specific variables, I would rely on living space square footage.

PartB

4

model3_data <-data[,c("OverallQual","GrLivArea", "SalePrice")]
model3 <- lm(SalePrice~., data=model3_data)

summary(model3)
## 
## Call:
## lm(formula = SalePrice ~ ., data = model3_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -393985  -23302   -1227   19949  283509 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)
## (Intercept) -109919.156    3420.959  -32.13 <0.0000000000000002
## OverallQual   33241.400     659.595   50.40 <0.0000000000000002
## GrLivArea        58.754       1.841   31.91 <0.0000000000000002
## 
## Residual standard error: 41370 on 2927 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.7319 
## F-statistic:  3998 on 2 and 2927 DF,  p-value: < 0.00000000000000022

a

\(Y = -109919.156 + 33241.400*X1 + 58.754*X2\)
The interpretation would be similar. we would ignore the intercept.
For every unit of Overall quality increase (while holding Gr living area constant), saleprice goes up 33,241, and on the flip slide, when OverallQual is held, for every unit change in living area square footage, sale price increases by 58.75. In this example, we speak to the individual impact, but account for the impact of the other. Something interesting to see is that adding overall quality in our model, the impact of squarefootage is much smaller. Again though, we caveat with explaning the impact of overallQual.

b

The R squared indicate a better fit at 73.2. both R squared and adjusted r squard indicate it is a better fit then model 1, which had an R2 of 49.95, a difference of .2325 point difference

c

\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)

\(H_{0} : B_{2} = 0\)
\(H_{1} : B_{2} \neq 0\)

\(H_{0}: B_{1}=B_{2}=0\) All the variables do not have a relationship to saleprice.

\(H_{1}: B_{i} > 0\). There is at least 1 variable that is related to SalePrice. Both variables are significant so we reject the null in all situations.

anova((model3))
## Analysis of Variance Table
## 
## Response: SalePrice
##               Df         Sum Sq        Mean Sq F value                Pr(>F)
## OverallQual    1 11941155651186 11941155651186  6978.2 < 0.00000000000000022
## GrLivArea      1  1742658594840  1742658594840  1018.4 < 0.00000000000000022
## Residuals   2927  5008722864326     1711213825

d

reg3<-reg(SalePrice~OverallQual+GrLivArea,data=model3_data)

4e

inspecting the plots, we still see issues with heteroscedasticity (although, an improvement from the prior plot). As is, the variables do break the assumptions of linear regression, but we can continue working with it for a possible transformation. We still do see some outliers in the more expensive homes, so they will need to be addressed.

5

model4_data <-data[,c("SalePrice", "OverallQual","GrLivArea", "WoodDeckSF")]
model4 <- lm(SalePrice~., data=model4_data)
summary(model4)
## 
## Call:
## lm(formula = SalePrice ~ ., data = model4_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -388293  -23729   -1327   18918  293383 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)
## (Intercept) -106703.125    3389.802 -31.478 <0.0000000000000002
## OverallQual   32374.109     656.834  49.288 <0.0000000000000002
## GrLivArea        56.519       1.831  30.873 <0.0000000000000002
## WoodDeckSF       57.838       6.221   9.297 <0.0000000000000002
## 
## Residual standard error: 40780 on 2926 degrees of freedom
## Multiple R-squared:  0.7397, Adjusted R-squared:  0.7395 
## F-statistic:  2772 on 3 and 2926 DF,  p-value: < 0.00000000000000022

a

The equation for this is \(-106703 + 32374X_{1} + 56.519X_{2} + 57.838X_{3}\)
Similar to the prior examples, all variables are significant and we see Overall quality having the most impact on sale price. Adding the new variable of WoodDeckSF has a similar impact to sale price. It is possible that square footage in general doesnt have the greatest impact on sale price per unit change.

b

Surprisingly, the R square did not improve much as it is only 73.9%, and improvement of .7 in explanation on the variance. However, looking at the coefficient as they are both similar, it may be because square footage doesnt have a strongly unit impact per sale price and it may be correlated with another predictor.

c

\(H_{0} : B_{1} = 0\)
\(H_{1} : B_{1} \neq 0\)

\(H_{0} : B_{2} = 0\)
\(H_{1} : B_{2} \neq 0\)

\(H_{0} : B_{3} = 0\)
\(H_{1} : B_{3} \neq 0\)

\(H_{0}: B_{1}=B_{2}=B_{3}=0\) All the variables do not have a relationship to saleprice.

\(H_{1}: B_{i} > 0\). There is at least 1 variable that is related to SalePrice. All 3 variables are significant so we reject the null in all situations.

anova(model4)
## Analysis of Variance Table
## 
## Response: SalePrice
##               Df         Sum Sq        Mean Sq  F value                Pr(>F)
## OverallQual    1 11941155651186 11941155651186 7181.842 < 0.00000000000000022
## GrLivArea      1  1742658594840  1742658594840 1048.098 < 0.00000000000000022
## WoodDeckSF     1   143700943544   143700943544   86.427 < 0.00000000000000022
## Residuals   2926  4865021920782     1662686918

d

using the new variable of WoodDeckSF, we actually dont see much improvement in the residuals vs predicted. So while we reject the null hypothesis, based on the residual vs fitted, we may not use woodDeck as a variable if we had a second attempt at this.

reg3<-reg(SalePrice~OverallQual + GrLivArea + WoodDeckSF, data = model4_data)

Part C

Plots of the log model below

#LNmodel1 <- lm(log(SalePrice)~GrLivArea, data=data_cont)
LNmodel1 <- data_cont[,c("SalePrice","GrLivArea")]
LNmodel1$LogSalePrice <-log(LNmodel1$SalePrice)
LNreg1<-reg(LogSalePrice~GrLivArea,data=LNmodel1)

LNmodel3 <- lm(log(SalePrice)~GrLivArea+OverallQual, data = model3_data)
LNmodel3 <- model3_data[,c("SalePrice","GrLivArea","OverallQual")]
LNmodel3$LogSalePrice <-log(LNmodel3$SalePrice)
LNreg3<-reg(LogSalePrice~GrLivArea+OverallQual,data=LNmodel3)

LNmodel4 <- lm(log(SalePrice)~GrLivArea+OverallQual+WoodDeckSF, data = model4_data)

LNmodel4 <- lm(log(SalePrice)~GrLivArea+OverallQual+WoodDeckSF, data = model4_data)
LNmodel4 <- model4_data[,c("SalePrice","GrLivArea","OverallQual", "WoodDeckSF")]
LNmodel4$LogSalePrice <-log(LNmodel4$SalePrice)
LNreg4<-reg(LogSalePrice~GrLivArea+OverallQual+WoodDeckSF,data=LNmodel4)

### 6
Comparing across models (see below), we see the most improvement in Model4 after transformation. Prior to transformation, our Model4 explains 74% of the variance. After transformation, we see an improvement of 4% to 77%. This suggests that our log transform does help. Looking at the Root mean squared error, we cant compare across models since the log model is in a different unit. However, this is an illustration of how our standard error does decrease as we add more variables.

df <- data.frame("Model"=c("Model1","Model3","Model4", "Model1_log", "Model3_log", "Model4_log"), 
                 "R-Squared"=c(.49,.73,.74,.48,.75,.77), "Adj R-Squared"=c(.49,.73,.74,.48,.76,.77), "RMSE"=c(237.7,203,201.9,.53,.44,.44))
df
##        Model R.Squared Adj.R.Squared   RMSE
## 1     Model1      0.49          0.49 237.70
## 2     Model3      0.73          0.73 203.00
## 3     Model4      0.74          0.74 201.90
## 4 Model1_log      0.48          0.48   0.53
## 5 Model3_log      0.75          0.76   0.44
## 6 Model4_log      0.77          0.77   0.44

7

The difference in interpretation for a transformed response variable is that the units are in a logrithmic scale. so you need to transform it back to regular scale once you are complete with model adjustments. Depending on the context, if we are talking about prediction(final result), the scales shouldn’t matter, but if we are talking about explaining, we would need to undo the log to speak to the non-technical audience.

Part D

8

LNmodel4 <- model4_data[,c("SalePrice","GrLivArea","OverallQual", "WoodDeckSF")]
LNmodel4$LogSalePrice <-log(LNmodel4$SalePrice)
LN_reg4<-lm(LogSalePrice~GrLivArea+OverallQual+WoodDeckSF,data=LNmodel4)

t1<-ols_plot_dffits(LN_reg4)

n=2930
p=3
threshold = 2*sqrt((3+1)/(2930-3-1))
threshold
## [1] 0.07394739
dffitsData <- t1$data
LNmodel4$Dffits_indicator <-dffitsData$dbetas

updatedData <- LNmodel4[which(LNmodel4$Dffits_indicator <= threshold & LNmodel4$Dffits_indicator >= -threshold),]
str(updatedData)
## 'data.frame':    2792 obs. of  6 variables:
##  $ SalePrice       : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
##  $ GrLivArea       : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ OverallQual     : int  6 5 6 7 5 6 8 8 8 7 ...
##  $ WoodDeckSF      : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ LogSalePrice    : num  12.3 11.6 12.1 12.4 12.2 ...
##  $ Dffits_indicator: num  0.026475 -0.019653 0.000818 0.016757 0.045278 ...
threshold
## [1] 0.07394739

With the updated data, we see an improvement of Rsquared to 82.8%, which is a 7% improvement. This occurs after we removed the influential points of about 138 points~ about 5% of the data.

The idea of removing the influential points appears to be an acceptable action because of the improvement and also because of the it only being about 5% of the total sample. Keeping in mind that we are still trying to predict a ‘typical’ home price, the using 95% of the data still tells the story.

UpdatdLN_reg4<-lm(LogSalePrice~GrLivArea+OverallQual+WoodDeckSF,data=updatedData)
summary(UpdatdLN_reg4)
## 
## Call:
## lm(formula = LogSalePrice ~ GrLivArea + OverallQual + WoodDeckSF, 
##     data = updatedData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.55709 -0.10070  0.00867  0.10878  0.53464 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)
## (Intercept) 10.544122065  0.013742501  767.26 <0.0000000000000002
## GrLivArea    0.000290128  0.000007699   37.68 <0.0000000000000002
## OverallQual  0.167840651  0.002685298   62.50 <0.0000000000000002
## WoodDeckSF   0.000336095  0.000025997   12.93 <0.0000000000000002
## 
## Residual standard error: 0.1561 on 2788 degrees of freedom
## Multiple R-squared:  0.8283, Adjusted R-squared:  0.8281 
## F-statistic:  4484 on 3 and 2788 DF,  p-value: < 0.00000000000000022

9

Approach

  1. Going back to the beginning, I’ll check the \(R^2\) and scatter plot for correlation (see appendix). I’ll remove ones that I feel are not correlated.
  2. I’ll then look at correlation for the predictor and remove any predictors that are correlated.
  3. Then I’ll scale my predictor variable and run the model.
  4. then I’ll use my model results and look at the potential to remove outliers.

Looking at \(R^2\)

Based on my plotsI will consider:
FrstFlrsquarefootage, GrLivArea, WoodDeckSF, LotArea, secondFlr sF, Openproach SF as those visually and numerically show correlation to sale price. Because the \(R^2\) is low overall, most of the variables I chose are still “low”.

Of these, I’m keeping GRLivArea, WoodDeckSF, LotArea, Open Porch. From an \(R^2\) perspective, these have lower correlation coefficients (to eachother). It also appears to make sense as I’m choosing characteristics for different portions of the home (living area, deck, land, and backyard)

corr <-cor(data_cont)
ggcorrplot(corr,
           type ="lower",
           lab = TRUE,
           lab_size=3,
           method = "circle",
           colors = c("red", "white", "purple"),
           title = "Correlogram of Housing prices",
           ggtheme = theme_classic)

adding to model 4

model5_data <- data[,c("SalePrice","GrLivArea","WoodDeckSF","LotArea","OpenPorchSF", "OverallQual")]
model5_data$LogSalePrice <- log(model5_data$SalePrice)

linear regression on raw data.

The \(R^{2}\) explains 77.4% of the variance.

model5_raw <-lm(LogSalePrice~GrLivArea+WoodDeckSF+LotArea+OpenPorchSF+OverallQual,data=model5_data)
summary(model5_raw)
## 
## Call:
## lm(formula = LogSalePrice ~ GrLivArea + WoodDeckSF + LotArea + 
##     OpenPorchSF + OverallQual, data = model5_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02276 -0.09395  0.01784  0.12139  0.62136 
## 
## Coefficients:
##                  Estimate    Std. Error t value            Pr(>|t|)
## (Intercept) 10.4937466507  0.0166185521 631.448 <0.0000000000000002
## GrLivArea    0.0002261516  0.0000092421  24.470 <0.0000000000000002
## WoodDeckSF   0.0002791945  0.0000298244   9.361 <0.0000000000000002
## LotArea      0.0000050654  0.0000004788  10.579 <0.0000000000000002
## OpenPorchSF  0.0001431600  0.0000571420   2.505              0.0123
## OverallQual  0.1810833272  0.0031714448  57.098 <0.0000000000000002
## 
## Residual standard error: 0.1939 on 2924 degrees of freedom
## Multiple R-squared:  0.7742, Adjusted R-squared:  0.7738 
## F-statistic:  2005 on 5 and 2924 DF,  p-value: < 0.00000000000000022
anova(model5_raw)
## Analysis of Variance Table
## 
## Response: LogSalePrice
##               Df  Sum Sq Mean Sq  F value                Pr(>F)
## GrLivArea      1 235.617 235.617 6269.898 < 0.00000000000000022
## WoodDeckSF     1  13.163  13.163  350.288 < 0.00000000000000022
## LotArea        1   0.930   0.930   24.755          0.0000006889
## OpenPorchSF    1   4.480   4.480  119.203 < 0.00000000000000022
## OverallQual    1 122.515 122.515 3260.188 < 0.00000000000000022
## Residuals   2924 109.881   0.038

scale and center the data

Unfortunately, scaling the data did not do much improvement to the \(R^2\) (77.4%). Because of this, I think using the standard units will make more sense from an interprability standpoint.

scaled_data_x <- model5_data[,c("GrLivArea","WoodDeckSF","LotArea","OpenPorchSF", "OverallQual")]
y <- model5_data[,"LogSalePrice"]

model5_data_scaled <- data.frame(lapply(scaled_data_x, function(x) scale(x,center=TRUE, scale=TRUE)))

final_data<-cbind(y,model5_data_scaled)
final_reg <- reg(y~GrLivArea+WoodDeckSF+LotArea+OpenPorchSF+OverallQual,data=final_data)

final_reg
## >>> Suggestion
## # Create an R markdown file for interpretative output with  Rmd = "file_name"
## reg(y ~ GrLivArea + WoodDeckSF + LotArea + OpenPorchSF + OverallQual, data=final_data, Rmd="eg")  
## 
## 
##   BACKGROUND
## 
## Data Frame:  final_data 
##  
## Response Variable: y 
## Predictor Variable 1: GrLivArea 
## Predictor Variable 2: WoodDeckSF 
## Predictor Variable 3: LotArea 
## Predictor Variable 4: OpenPorchSF 
## Predictor Variable 5: OverallQual 
##  
## Number of cases (rows) of data:  2930 
## Number of cases retained for analysis:  2930 
## 
## 
##   BASIC ANALYSIS
## 
##               Estimate    Std Err  t-value  p-value    Lower 95%    Upper 95% 
## (Intercept) 12.0209687  0.0035813 3356.605    0.000   12.0139466   12.0279908 
##   GrLivArea  0.1143217  0.0046720   24.470    0.000    0.1051610    0.1234823 
##  WoodDeckSF  0.0352795  0.0037687    9.361    0.000    0.0278900    0.0426689 
##     LotArea  0.0399156  0.0037731   10.579    0.000    0.0325173    0.0473139 
## OpenPorchSF  0.0096609  0.0038561    2.505    0.012    0.0020999    0.0172219 
## OverallQual  0.2555133  0.0044750   57.098    0.000    0.2467388    0.2642878 
## 
## 
## Standard deviation of residuals:  0.1938532 for 2924 degrees of freedom 
##  
## R-squared:  0.774    Adjusted R-squared:  0.774    PRESS R-squared:  0.771 
## 
## Null hypothesis that all population slope coefficients are 0:
##   F-statistic: 2004.866     df: 5 and 2924     p-value:  0.000 
## 
## 
##                   df       Sum Sq      Mean Sq       F-value   p-value 
##   GrLivArea        1  235.6169427  235.6169427  6269.8976109     0.000 
##  WoodDeckSF        1   13.1634836   13.1634836   350.2876029     0.000 
##     LotArea        1    0.9302501    0.9302501    24.7544721     0.000 
## OpenPorchSF        1    4.4795301    4.4795301   119.2027827     0.000 
## OverallQual        1  122.5148388  122.5148388  3260.1878549     0.000 
## Model              5  376.7050453   75.3410091  2004.8660647     0.000 
##  
## Residuals       2924  109.8812107    0.0375791 
##  
## y               2929  486.5862561    0.1661271 
## 
## 
##   K-FOLD CROSS-VALIDATION
## 
##   RELATIONS AMONG THE VARIABLES
## 
##                  y GrLivArea WoodDeckSF LotArea OpenPorchSF OverallQual 
##             y 1.00      0.70       0.33    0.26        0.32        0.83 
##     GrLivArea 0.70      1.00       0.25    0.29        0.34        0.57 
##    WoodDeckSF 0.33      0.25       1.00    0.16        0.04        0.26 
##       LotArea 0.26      0.29       0.16    1.00        0.10        0.10 
##   OpenPorchSF 0.32      0.34       0.04    0.10        1.00        0.30 
##   OverallQual 0.83      0.57       0.26    0.10        0.30        1.00 
## 
## 
##               Tolerance       VIF 
##     GrLivArea     0.588     1.701 
##    WoodDeckSF     0.903     1.107 
##       LotArea     0.901     1.110 
##   OpenPorchSF     0.863     1.159 
##   OverallQual     0.641     1.561 
## 
## 
##  GrLivArea WoodDeckSF LotArea OpenPorchSF OverallQual    R2adj    X's 
##          1          1       1           1           1    0.774      5 
##          1          1       1           0           1    0.773      4 
##          1          0       1           1           1    0.767      4 
##          1          0       1           0           1    0.767      3 
##          1          1       0           1           1    0.765      4 
##          1          1       0           0           1    0.765      3 
##          1          0       0           1           1    0.757      3 
##          1          0       0           0           1    0.756      2 
##          0          1       1           1           1    0.728      4 
##          0          1       1           0           1    0.723      3 
##          0          0       1           1           1    0.716      3 
##          0          0       1           0           1    0.712      2 
##          0          1       0           1           1    0.704      3 
##          0          1       0           0           1    0.697      2 
##          0          0       0           1           1    0.687      2 
##          0          0       0           0           1    0.682      1 
##          1          1       1           1           0    0.522      4 
##          1          1       0           1           0    0.520      3 
##          1          1       1           0           0    0.513      3 
##          1          1       0           0           0    0.511      2 
##          1          0       1           1           0    0.495      3 
##          1          0       0           1           0    0.492      2 
##          1          0       1           0           0    0.487      2 
##          1          0       0           0           0    0.484      1 
##          0          1       1           1           0    0.235      3 
##          0          1       0           1           0    0.205      2 
##          0          1       1           0           0    0.153      2 
##          0          0       1           1           0    0.152      2 
##          0          1       0           0           0    0.111      1 
##          0          0       0           1           0    0.102      1 
##          0          0       1           0           0    0.065      1 
##  
## [based on Thomas Lumley's leaps function from the leaps package] 
##  
## 
## 
##   RESIDUALS AND INFLUENCE
## 
## Data, Fitted, Residual, Studentized Residual, Dffits, Cook's Distance 
##    [sorted by Cook's Distance] 
##    [res_rows = 20, out of 2930 rows of data, or do res_rows="all"] 
## --------------------------------------------------------------------------------------------------------------------------------- 
##         GrLivArea WoodDeckSF    LotArea OpenPorchSF OverallQual          y     fitted      resid      rstdnt     dffits     cooks 
##   1499  8.1943358  0.9516195  6.8196646   3.6226176   2.7675742 11.9829291 14.0056921 -2.0227630 -10.8192794 -1.9969220 0.6392400 
##    957  1.0609300 -0.7419335 26.0274893  -0.7043724   0.6414619 12.8346813 13.3120807 -0.4773994  -2.8568330 -1.6717325 0.4646400 
##   2181  7.1122579  3.5790007  3.6982249   6.4677617   2.7675742 12.1218755 13.8775724 -1.7556969  -9.3208180 -1.5852155 0.4068700 
##   2182  6.2833901  0.9041367  3.8002552   5.3119219   2.7675742 12.1267588 13.6813532 -1.5545944  -8.1911686 -1.1649043 0.2211700 
##   1571  0.5663789  4.5603118 19.6080876  -0.7043724  -0.7759464 12.3412589 12.8242023 -0.4829433  -2.6953089 -1.1047342 0.2029700 
##    727 -1.5423872 -0.7419335 -0.2879336   7.0456816  -1.4846505 10.4602421 11.4956922 -1.0354501  -5.4343615 -0.8610147 0.1223600 
##   2116  1.2745761 -0.7419335 18.8898151   0.7478365  -0.0672422 12.5317728 12.8845470 -0.3527742  -1.9539431 -0.7611590 0.0964700 
##   1554 -1.5166706 -0.7419335  0.5629528  -0.7043724  -3.6107628  9.4803675 10.9144732 -1.4341057  -7.4865501 -0.5392086 0.0475600 
##    182 -1.3208283 -0.7419335 -0.0624265  -0.7043724  -2.9020587  9.4563407 11.0929831 -1.6366424  -8.5604587 -0.4911544 0.0392400 
##   2072  0.6415507  2.2653101 13.3249799   0.0069136   0.6414619 12.6181823 12.8700740 -0.2518917  -1.3444756 -0.3564682 0.0211700 
##   1611  0.6652891 -0.7419335  5.8949205  -0.7043724  -0.7759464 11.5424843 12.1010804 -0.5585961  -2.9054323 -0.3446331 0.0197500 
##   1183  2.8571398  0.3659983  1.8304626   0.3329197   2.0588701 11.9183906 12.9628629 -1.0444723  -5.4248814 -0.3427887 0.0194000 
##   1403  0.3230597  1.0307575  5.4668758   1.3257564  -1.4846505 12.4529327 11.9459396  0.5069931   2.6350974  0.3016304 0.0151300 
##   2044 -0.0310389 -0.6311403  0.0218373   2.4223224  -1.4846505 10.8197783 11.6400796 -0.8203013  -4.2529152 -0.2772820 0.0127400 
##   1321  2.5228232  0.0890154  0.4253897   6.7344940   2.7675742 12.6915805 13.1017156 -0.4101351  -2.1342588 -0.2732020 0.0124200 
##   2911  2.0203592  5.0193122 -0.7847599  -0.4080033  -0.0672422 11.9276806 12.3765710 -0.4488903  -2.3318266 -0.2608144 0.0113200 
##   2196  2.1608118 -0.7419335  0.0941721  10.2909242  -0.7759464 11.9183906 12.1467354 -0.2283448  -1.2039219 -0.2538657 0.0107400 
##   2294 -0.2842491 10.5273162  0.7371910  -0.7043724  -0.0672422 12.1441972 12.3653101 -0.2211128  -1.1654485 -0.2441374 0.0099300 
##    754  1.2429248 -0.7419335 -0.9070946   0.1254613  -0.7759464 11.0429218 11.9036271 -0.8607053  -4.4607614 -0.2401771 0.0095500 
##   1570  0.3744930  2.6451725 -0.0200408   1.0738426  -1.4846505 12.3392915 11.7873280  0.5519635   2.8590746  0.2177606 0.0078800 
## 
## 
##   FORECASTING ERROR
## 
## Data, Predicted, Standard Error of Forecast, 95% Prediction Intervals 
##    [sorted by lower bound of prediction interval] 
##    [to see all intervals do pred_rows="all"] 
## ------------------------------------------------------------------------------------------------------------------------------- 
##         GrLivArea WoodDeckSF    LotArea OpenPorchSF OverallQual          y       pred        sf     pi:lwr     pi:upr     width 
##   1902 -2.3059742 -0.7419335 -0.6532881  -0.7043724  -3.6107628 10.5789798 10.7756917 0.1943320 10.3946503 11.1567331 0.7620829 
##   1554 -1.5166706 -0.7419335  0.5629528  -0.7043724  -3.6107628  9.4803675 10.9144732 0.1943528 10.5333910 11.2955554 0.7621644 
##    766 -1.1783976 -0.7419335 -0.0162337  -0.7043724  -3.6107628 11.0186291 10.9300265 0.1943562 10.5489376 11.3111155 0.7621779 
## ... 
##    789  0.3744930 -0.7419335 -0.0371727  -0.7043724  -0.0672422 11.7558716 12.0121364 0.1939446 11.6318545 12.3924183 0.7605637 
##   1729  0.0718277  0.0494464  0.0113043  -0.1412710  -0.0672422 12.0782393 12.0128297 0.1938880 11.6326589 12.3930005 0.7603416 
##    162 -0.5453721  2.5343793 -0.2644565  -0.7043724  -0.0672422 11.7905572 12.0134903 0.1941579 11.6327901 12.3941904 0.7614003 
## ... 
##   2451  3.9570215  6.1430716  0.9002617   0.5700150   2.0588701 13.2785121 13.2575762 0.1953979 12.8744448 13.6407075 0.7662627 
##    957  1.0609300 -0.7419335 26.0274893  -0.7043724   0.6414619 12.8346813 13.3120807 0.2171743 12.8862506 13.7379109 0.8516603 
##   2667  4.1706676 -0.7419335  1.6246255   3.1484269   2.7675742 13.0710701 13.2740079 0.1946960 12.8922528 13.6557630 0.7635102 
## 
## 
## ---------------------------------- 
## Plot 1: Distribution of Residuals 
## Plot 2: Residuals vs Fitted Values 
## Plot 3: ScatterPlot Matrix 
## ----------------------------------

Using the non-scaled data, the equation is: \(Y=10.493+.00226X_{1}+ .00226X_{2}+ .00226X_{3}+ .00226X_{4}+ .00226X_{5}\). Interpreting the coefficients is not reccomednded as we’ll need to undo the log, so we wont interpret the coeffient, but we do see significance in all the variables. We are also get to reject the null hyptohesis in the t-statistic and the omnibus test.

looking at the residuals plots:
Residual vs leverage display a few points outside of the cooks distance, which is an indication of influence of high leverage. Also, atleast between 0.00 to 0.05, the spread of standaridized residuals do seem to decrease just a bit.

Normal QQ plot appears to tell us that the data is not normally distributed.

scale location tells us the spread of the residuals among the range of predictors.we do see a decently straight line with random points, so I’m tending to lean on this satisfying homoscaedcisty.

Residuals vs fitted do show a fairly straight line and random points, so this tells us that that there are not any non-linear relationships.

plot(model5_raw)

Now we run DFFITS again on this data and remove outliers/

t2<-ols_plot_dffits(model5_raw)

n=2930
p=5
threshold = 2*sqrt((3+1)/(2930-3-1))
threshold
## [1] 0.07394739

model refit

dffitsData <- t2$data
model5_data$Dffits_indicator <-dffitsData$dbetas

updatedData5 <- model5_data[which(model5_data$Dffits_indicator <= threshold & model5_data$Dffits_indicator >= -threshold),]

str(updatedData5)
## 'data.frame':    2734 obs. of  8 variables:
##  $ SalePrice       : int  215000 105000 172000 244000 189900 195500 213500 191500 236500 189000 ...
##  $ GrLivArea       : int  1656 896 1329 2110 1629 1604 1338 1280 1616 1804 ...
##  $ WoodDeckSF      : int  210 140 393 0 212 360 0 0 237 140 ...
##  $ LotArea         : int  31770 11622 14267 11160 13830 9978 4920 5005 5389 7500 ...
##  $ OpenPorchSF     : int  62 0 36 0 34 36 0 82 152 60 ...
##  $ OverallQual     : int  6 5 6 7 5 6 8 8 8 7 ...
##  $ LogSalePrice    : num  12.3 11.6 12.1 12.4 12.2 ...
##  $ Dffits_indicator: num  0.02761 -0.02369 -0.00338 0.02404 0.04524 ...
updatedData5_reg5<-lm(LogSalePrice~GrLivArea+WoodDeckSF+LotArea+OpenPorchSF+OverallQual,data=updatedData5)
summary(updatedData5_reg5)
## 
## Call:
## lm(formula = LogSalePrice ~ GrLivArea + WoodDeckSF + LotArea + 
##     OpenPorchSF + OverallQual, data = updatedData5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.48807 -0.09094  0.00701  0.10065  0.48105 
## 
## Coefficients:
##                  Estimate    Std. Error t value             Pr(>|t|)
## (Intercept) 10.4970767479  0.0133353759 787.160 < 0.0000000000000002
## GrLivArea    0.0002409056  0.0000074306  32.421 < 0.0000000000000002
## WoodDeckSF   0.0002834005  0.0000237375  11.939 < 0.0000000000000002
## LotArea      0.0000109425  0.0000006258  17.485 < 0.0000000000000002
## OpenPorchSF  0.0002985507  0.0000495421   6.026         0.0000000019
## OverallQual  0.1687696610  0.0025012476  67.474 < 0.0000000000000002
## 
## Residual standard error: 0.1396 on 2728 degrees of freedom
## Multiple R-squared:  0.8595, Adjusted R-squared:  0.8592 
## F-statistic:  3337 on 5 and 2728 DF,  p-value: < 0.00000000000000022

The updated \(R^{2}\) is 85%

Conclusion

In general, transformation and data removal appears to help in overall model performance. While a benefit on the surface, the tradeoff is interprtability as we log and transform the data and we reduce datapoints. I was surprised though, that scaling did not do much to the result.

The hypothesis tests I do see as a good guide, but as I dug deeper, I feel it does seem to bring subjectivity in how to pick model variables. For example, while a statistic would say a variable is signficant, the scatterplot may tempt me to not pick it, or vice versa. So far, every variable I picked said it is statistically signficant, so it does bring a bit of doubt as “too good to be true.”

As for next steps, I think an automated variable selection method would be the logical next step to do some of the manual work. my variable selections were chosen purely based on a high level look, so having an automated way to look at EVERY \(R^{2}\) difference of all predictor combination will help.

Appendix

####Initial histogram chart of continuous variables.
Normally distributed for some while some are skewed. While independent variables are not assumed to be normal for a linear regression model, the skewness can tell us a bit about potential outliers and influential variables.

options(scipen = 999)
for(i in 1:length(data_cont)){
  print(
      ggplot(data_cont, aes(x=data_cont[,i])) + geom_histogram(fill = "purple") + theme_classic() + 
      labs(x=colnames(data_cont[i]), 
           y="Frequency",title = paste0("Frequency of"," ",colnames(data_cont[i])))) 
  
}

###Scatter plot of continuous variables to saleprice.
The scatter plot allows us to visually see correlation to sale price. anything that doesn’t look correlated may be considered to be removed.
***

for(i in 1:length(data_cont)){
    print(
      ggplot(data_cont, aes(x=data_cont[,i], y=SalePrice)) + geom_point(col = "purple") + geom_smooth(method='lm', fill = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
      labs(x=colnames(data_cont[i]), 
           y="Sale Price",title = paste0("Scatterplot of"," ",colnames(data_cont[i])))) 
}

###Correlation matrix
***

The correlation matrix confirms our scatter plots and shows which ones have the highest correlation coefficient. a high \(R^{2}\) may be an indication of the variable importance.

corr <-cor(data_cont)
ggcorrplot(corr,
           type ="lower",
           lab = TRUE,
           lab_size=3,
           method = "circle",
           colors = c("red", "white", "purple"),
           title = "Correlogram of Housing prices",
           ggtheme = theme_classic)

### Q-Q plot
The Q-Q plot reinforces the histogram view and shows us which ones are ‘normal’.

for(i in 1:length(data_cont)){
  if (class(data_cont[,i]) == "integer"){
    print(
      ggplot(data_cont, aes(sample=data_cont[,i])) + 
        stat_qq(col = "purple") + 
        stat_qq_line(col = "red") + theme_classic() + theme(axis.text.x = element_text(angle = 45, hjust=1))+
      labs(x="theoritical", 
           y="sample",title = paste0("Q-Q to test for Normality of "," ",colnames(data_cont[i])))) 
  } else {
    next
  }
}